Assessment & Standards
Amazon's AI-generated summary of popular conservative book accuses it of 'extreme' rhetoric
Markowicz previously explained why they wrote the book in a Fox News Digital opinion piece, noting that in 2021, then-Democratic Virginia gubernatorial candidate Terry McAuliffe said, "I don't think parents should be telling schools what they should teach." "Taken on its own, the comment might even be benign. Sure, parental involvement in education had always been a prediction of student success. A 2010 study called'Parent Involvement and Student Academic Performance: A Multiple Mediational Analysis' by researchers at the Warren Alpert Medical School of Brown University and the University of North Carolina at Greensboro found'children whose parents are more involved in their education have higher levels of academic performance than children whose parents are involved to a lesser degree." But should parents be designing a curriculum?
Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension
Nguyen, Anh Duc, Phi, Hieu Minh, Ngo, Anh Viet, Trieu, Long Hai, Nguyen, Thai Phuong
Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: https://huggingface.co/iaiuet.
Texas private school's use of new 'AI tutor' rockets student test scores to top 2% in the country
Alpha School co-founder Mackenzie Price and a junior at the school, Elle Kristine, join'Fox & Friends' to discuss the benefits of incorporating artificial intelligence into the classroom. A Texas private school is seeing student test scores soar to new heights following the implementation of an artificial intelligence (AI) "tutor." At Alpha School in Austin, Texas, students are placed in the classroom for two hours a day with an AI assistant, using the rest of the day to focus on skills like public speaking, financial literacy, and teamwork. "We use an AI tutor and adaptive apps to provide a completely personalized learning experience for all of our students, and as a result our students are learning faster, they're learning way better. In fact, our classes are in the top 2% in the country," Alpha School co-founder Mackenzie Price told "Fox & Friends." Will A.I. make schools'obsolete,' or does it present a new'opportunity' for the education system?
Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection
Qwaider, Chatrine, Alhafni, Bashar, Chirkunov, Kirill, Habash, Nizar, Briscoe, Ted
Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback. Arabic AES systems are particularly challenged by the lack of annotated essay datasets. This paper presents a novel framework leveraging Large Language Models (LLMs) and Transformers to generate synthetic Arabic essay datasets for AES. We prompt an LLM to generate essays across CEFR proficiency levels and introduce controlled error injection using a fine-tuned Standard Arabic BERT model for error type prediction. Our approach produces realistic human-like essays, contributing a dataset of 3,040 annotated essays. Additionally, we develop a BERT-based auto-marking system for accurate and scalable Arabic essay evaluation. Experimental results demonstrate the effectiveness of our framework in improving Arabic AES performance.
Efficient multi-prompt evaluation of LLMs, Lucas Weber
Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Moreover, we show how PromptEval can be useful in LLM-as-a-judge and best prompt identification applications.
Generating Correct Answers for Progressive Matrices Intelligence Tests
Raven's Progressive Matrices are multiple-choice intelligence tests, where one tries to complete the missing location in a 3 3 grid of abstract images. Previous attempts to address this test have focused solely on selecting the right answer out of the multiple choices. In this work, we focus, instead, on generating a correct answer given the grid, without seeing the choices, which is a harder task, by definition. The proposed neural model combines multiple advances in generative models, including employing multiple pathways through the same network, using the reparameterization trick along two pathways to make their encoding compatible, a dynamic application of variational losses, and a complex perceptual loss that is coupled with a selective backpropagation procedure. Our algorithm is able not only to generate a set of plausible answers, but also to be competitive to the state of the art methods in multiple-choice tests.
Dialogic Learning in Child-Robot Interaction: A Hybrid Approach to Personalized Educational Content Generation
Malnatsky, Elena, Wang, Shenghui, Hindriks, Koen V., Ligthart, Mike E. U.
Dialogic learning fosters motivation and deeper understanding in education through purposeful and structured dialogues. Foundational models offer a transformative potential for child-robot interactions, enabling the design of personalized, engaging, and scalable interactions. However, their integration into educational contexts presents challenges in terms of ensuring age-appropriate and safe content and alignment with pedagogical goals. We introduce a hybrid approach to designing personalized educational dialogues in child-robot interactions. By combining rule-based systems with LLMs for selective offline content generation and human validation, the framework ensures educational quality and developmental appropriateness. We illustrate this approach through a project aimed at enhancing reading motivation, in which a robot facilitated book-related dialogues.
An Autoencoder-Like Nonnegative Matrix Co-Factorization for Improved Student Cognitive Modeling
Student cognitive modeling (SCM) is a fundamental task in intelligent education, with applications ranging from personalized learning to educational resource allocation. By exploiting students' response logs, SCM aims to predict their exercise performance as well as estimate knowledge proficiency in a subject. Data mining approaches such as matrix factorization can obtain high accuracy in predicting student performance on exercises, but the knowledge proficiency is unknown or poorly estimated. The situation is further exacerbated if only sparse interactions exist between exercises and students (or knowledge concepts). To solve this dilemma, we root monotonicity (a fundamental psychometric theory on educational assessments) in a co-factorization framework and present an autoencoder-like nonnegative matrix co-factorization (AE-NMCF), which improves the accuracy of estimating the student's knowledge proficiency via an encoder-decoder learning pipeline.
Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement
Gray, Andy, Rahat, Alma, Lindsay, Stephen, Pearson, Jen, Crick, Tom
Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ's quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.